Skip to content

Conversation

@DrakeLin
Copy link
Collaborator

@DrakeLin DrakeLin commented Jan 23, 2026

🥞 Stacked PR

Use this link to review incremental changes.


  • Import StatisticsCollector in parquet.rs
  • Add stats field to DataFileMetadata with with_stats() method
  • Update as_record_batch to use full stats if available
  • Update write_parquet_file to collect and attach stats
  • Update mod.rs write_parquet to pass stats_columns
  • Update write tests to expect full stats output
  • Fix write-table example API call

What changes are proposed in this pull request?

How was this change tested?

@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 81.87192% with 184 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.57%. Comparing base (d4ecc0a) to head (f488959).

Files with missing lines Patch % Lines
kernel/src/engine/default/stats.rs 79.13% 141 Missing and 9 partials ⚠️
kernel/src/transaction/mod.rs 70.00% 15 Missing and 9 partials ⚠️
kernel/src/engine/default/parquet.rs 90.38% 1 Missing and 4 partials ⚠️
kernel/src/table_configuration.rs 0.00% 2 Missing ⚠️
ffi/src/transaction/write_context.rs 91.66% 0 Missing and 1 partial ⚠️
kernel/src/snapshot.rs 80.00% 0 Missing and 1 partial ⚠️
uc-catalog/src/lib.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1666      +/-   ##
==========================================
- Coverage   84.65%   84.57%   -0.08%     
==========================================
  Files         123      124       +1     
  Lines       34109    35063     +954     
  Branches    34109    35063     +954     
==========================================
+ Hits        28875    29655     +780     
- Misses       3905     4059     +154     
- Partials     1329     1349      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DrakeLin DrakeLin force-pushed the stack/stats-integration branch 11 times, most recently from 6c7e071 to 6b1b619 Compare January 23, 2026 23:26
- Add stats_columns parameter to write_parquet_file trait
- Add stats_schema(), stats_columns(), get_clustering_columns() to Transaction
- Add stats_columns to WriteContext
- Update get_write_context() to take engine parameter
- Add clustering column support to expected_stats_schema()
- Add StatisticsCollector struct with new(), update(), finalize()
- Track numRecords across multiple RecordBatches
- Output StructArray with {numRecords, tightBounds}
- Basic unit tests for single/multiple batches

This is the foundation for full stats collection, adding column-level
stats (nullCount, minValues, maxValues) in subsequent PRs.
- Add null count tracking for all columns
- Support nested struct null counts
- Merge null counts across multiple batches
- Only collect for columns in stats_columns
- Tests for null counting across batches
- Add min/max tracking for all supported types
- Primitive types (int8-64, uint8-64, float32/64)
- Date, timestamp with all time units
- Decimal128
- String types with truncation to 32 chars
- Merge min/max across multiple batches
- Tests for min/max across single and multiple batches
- Add NullBuffer mask parameter to update()
- Only count masked-in rows for numRecords
- Only count nulls in masked-in rows for nullCount
- Filter column by mask before computing min/max
- Tests for mask behavior with min/max and null counting

This enables deletion vector support where masked-out rows
should not contribute to file statistics.
- Import StatisticsCollector in parquet.rs
- Add stats field to DataFileMetadata with with_stats() method
- Update as_record_batch to use full stats if available
- Update write_parquet_file to collect and attach stats
- Update mod.rs write_parquet to pass stats_columns
- Update write tests to expect full stats output
- Fix write-table example API call
@DrakeLin DrakeLin force-pushed the stack/stats-integration branch from 6b1b619 to 9849c2c Compare January 23, 2026 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant